Bayesian Text Categorization

نویسندگان

Heri Ramampiaro

Arild Brandrud Næss

چکیده

Natural language processing is an interdisciplinary field of research which studies the problems and possibilities of automated generation and understanding of natural human languages. Text categorization is a central subfield of natural language processing. Automatically assigning categories to digital texts has a wide range of applications in today’s information society—from filtering spam to creating web hierarchies and digital newspaper archives. It is a discipline that lends itself more naturally to machine learning than to knowledge engineering; statistical approaches to text categorization are therefore a promising field of inquiry. We provide a survey of the state of the art in text categorization, presenting the most widespread methods in use, and placing particular emphasis on support vector machines—an optimization algorithm that has emerged as the benchmark method in text categorization in the past ten years. We then turn our attention to Bayesian logistic regression, a fairly new, and largely unstudied method in text categorization. We see how this method has certain similarities to the support vector machine method, but also differs from it in crucial respects. Notably, Bayesian logistic regression provides us with a statistical framework. It can be claimed to be more modular, in the sense that it is more open to modifications and supplementations by other statistical methods; whereas the support vector machine method remains more of a black box. We present results of thorough testing of the BBR toolkit for Bayesian logistic regression on three separate data sets. We demonstrate which of BBR’s parameters are of importance; and we show that its results compare favorably to those of the SVMlight toolkit for support vector machines. We also present two extensions to the BBR toolkit. One attempts to incorporate domain knowledge by way of the prior probability distributions of single words; the other tries to make use of uncategorized documents to boost learning accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Algorithm of Bayesian Text Categorization

Text categorization is a fundamental methodology of text mining and a hot topic of the research of data mining and web mining in recent years. It plays an important role in building traditional information retrieval, web indexing architecture, Web information retrieval, and so on. This paper presents an improved algorithm of text categorization that combines the feature weighting technique with...

متن کامل

Sparse Bayesian Classifiers for Text Categorization (U)

(U) This paper empirically compares the performance of different Bayesian models for text categorization. In particular we examine so-called “sparse” Bayesian models that explicitly favor simplicity. We present empirical evidence that these models retain good predictive capabilities while offering significant computational advantages.

متن کامل

Sparse Bayesian Classifiers for Text Categorization

This paper empirically compares the performance of different Bayesian models for text categorization. In particular we examine so-called “sparse” Bayesian models that explicitly favor simplicity. We present empirical evidence that these models retain good predictive capabilities while offering significant computational advantages.

متن کامل

Hierarchical Bayesian Clustering for Automatic Text Classification

Text classification, the grouping of texts into several clusters, has been used as a means of improving both the efficiency and the effectiveDess of text retrieval/categorization In this paper we propose a hierarchical clustering algor i thm that constructs a Bet of clusters having the maximum Bayesian posterior probability, the probability that the given texts are classified into clusters We c...

متن کامل

Bayesian Nets in Syntactic Categorization of Novel Words

This paper presents an application of a Dynamic Bayesian Network (DBN) to the task of assigning Part-of-Speech (PoS) tags to novel text. This task is particularly challenging for non-standard corpora, such as Internet lingo, where a large proportion of words are unknown. Previous work reveals that PoS tags depend on a variety of morphological and contextual features. Representing these dependen...

متن کامل